RDD From existing RDD

As we discussed earlier, that RDD is immutable so, we can not change anything to it. So we can create different RDD from the existing RDDs. This process of creating another dataset from the existing ones means transformation. As a result, transformation always produces new RDD. As they are immutable, no changes take place in it if once created. This property maintains the consistency over the cluster.

val rdd  = sc.parallelize(List("animal", "human", "bird", "rat"))
val rdd1= rdd.map(x =>(x,x.size))
rdd1.collect.foreach(println)


val rdd  = sc.parallelize(List("animal", "human", "bird", "rat"))
val rdd1= rdd.map(x =>(x,x.length))
rdd1.collect.foreach(println)


val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)


From Existing RDD
val rdd3 = rdd.map(row=>{(row._1,row._2+100)})

From existing DataFrames and DataSet
To convert DataSet or DataFrame to RDD just use rdd method on any of these data type.
val myRdd2 = spark.range(20).toDF().rdd

No comments:

Post a Comment